Ever since I was little, I've followed sports. Collecting cards, reading up on the news and watching many games as most kids do. And one thing that is constantly discussed in sports is the data, especially more so in recent years as technology and models have advanced. The prime example being the infamous Moneyball A's. But in recent years, if you read many sports articles, they always mention statistics and data. For example, the 2017 Superbowl between the Patriots and the Falcons where the Patriots came back from a .02% chance of winning (according the ESPN's win-probability graph). The models and statistics and unlikelihood of the comeback were talked about for months, especially in the highly data driven NFL, where every play can be broken down and analyzed thoroughly. The same can be said for basketball, baseball, and tennis. But one sport where this fails is soccer.
Soccer has been the "problem child" of sports data science as the game was always considered too complicated and too fluid to be analyzed. Many managers and coaches relied on instinct and feel for the game and still rely on these traits. But slowly over time, this has been changing. I'm a huge Liverpool fan and earlier in the year read an article about how Liverpool has gone from mediocre over the past few years to completely dominant with the help of their analytics department (https://www.nytimes.com/2019/05/22/magazine/soccer-data-liverpool.html). Many of Liverpool's world-class bargain signings came from their analysis of the data and statistics that people can't see.
This is the inspiration and the idea behind this tutorial. Can a model be created to predict which players will become world-class players? In this tutorial, I'll be taking data from FIFA's assessment of players over the past 5 years and create a model to try to predict player's current level of play. I'll compare my model with FIFA's most recent assessment as well as the player's current in-game form. I hope that this tutorial show fans that data can be used to help assess players and perhaps get the more data-driven people who aren't soccer fans to look into cracking one of the hardest sports to analyze through data.
To start we will be using different libraries to help us retrieve, visualize and analyze the data. To name a few, we will be using Pandas and Numpy to help process the data. Matplotlib and Seaborn will be used to visualize the data and Scikit will be used to help create our model and test our model.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
The first step will be to retrieve our data and to process it in a way that it can be used for our model. As stated before, in this tutorial we will be using the FIFA rating data which can be found here.
I decided to use this data as it is a very comprehensive list of players and one of the most easily obtainable. It provides similar metrics over the last few years as FIFA hasn't changed the metrics it collects on players. Another reason is that data for soccer isn't that readily available. Most data is collected by individuals who want to mess with it and can be found via Twitter graphs or is super expensive and professional (Opta being on of the only distributors). Thus, I settled for the best I could do which is this FIFA data.
The following pieces of data are stored in the Github repository which can be found here.
fifa15 = pd.read_csv("players_15.csv")
fifa16 = pd.read_csv("players_16.csv")
fifa17 = pd.read_csv("players_17.csv")
fifa18 = pd.read_csv("players_18.csv")
fifa19 = pd.read_csv("players_19.csv")
fifa15.head()
Above we have an example of the data from 2015 (all the data from the datasets we want are uniform, so not too much to worry about there). It can be seen that there is a lot of information, most of it is data that we don't need. The most important data is the name of the player, their age, club, overall rating, and rating for each skill.
The following code filters that information and then further makes sure to remove the changes that have occurred. For example, if a player started with an 80 in passing but improved over the course of that year/season, FIFA would update their ranking by adding +1. So the data reflects this by stating that their passing rating is 80+1. This isn't convenient for our data, so the following code helps make 80+1 simply 81.
filterpoint = ["short_name","age","club","overall","player_positions","attacking_crossing",
"attacking_finishing","attacking_heading_accuracy","attacking_short_passing",
"attacking_volleys","skill_dribbling","skill_curve","skill_fk_accuracy",
"skill_long_passing","skill_ball_control","movement_acceleration","movement_sprint_speed",
"movement_agility","movement_reactions","movement_balance","power_shot_power",
"power_jumping","power_stamina","power_strength","power_long_shots","mentality_aggression",
"mentality_interceptions","mentality_positioning","mentality_vision","mentality_penalties",
"mentality_composure","defending_marking","defending_standing_tackle",
"defending_sliding_tackle","goalkeeping_diving","goalkeeping_handling","goalkeeping_kicking"
,"goalkeeping_positioning","goalkeeping_reflexes","sofifa_id"]
fifa15 = fifa15.filter(filterpoint)
fifa16 = fifa16.filter(filterpoint)
fifa17 = fifa17.filter(filterpoint)
fifa18 = fifa18.filter(filterpoint)
fifa19 = fifa19.filter(filterpoint)
def filtering(df):
for i, rows in df.iterrows():
if len(rows[4]) > 3:
df.at[i,df.columns[4]] = rows[4][0:rows[4].find(",")]
for j in range(5,len(rows)-1):
if type(rows[j]) == str and (rows[j].find("+") != -1 or rows[j].find("-") != -1) :
df.at[i,df.columns[j]] = str(eval(rows[j]))
filtering(fifa15)
filtering(fifa16)
filtering(fifa17)
filtering(fifa18)
filtering(fifa19)
fifa15["year"] = 2015
fifa16["year"] = 2016
fifa17["year"] = 2017
fifa18["year"] = 2018
fifa19["year"] = 2019
fifa15.head()
The above table is the final list of data that we want from each year's dataset. Although there are 39 different columns the data we need from each is fairly simple:
fifa20 = pd.read_csv("players_20.csv")
fifa20 = (fifa20.filter(filterpoint))
filtering(fifa20)
fifa20["year"] = 2020
fifa20.head()
This table is the previous processing and tidying bundled up neatly. It also displays the final table that we will be comparing to, the FIFA 20 data that is currently still being updated week by week by FIFA.
Although we've done some processing and tidying, the following processing and tidying is splitting the current data we have up so that we can create a more accurate model later with a lot less random variables to account for.
With all the various skills that are listed in FIFA, it would seem impossible that any one player would be a master in all and that is correct. What FIFA actually does to assess a player's overall rating is to assess what FIFA has deemed the necessary skills for the player's position. FIFA assigns different skills to different positions and then also weighs the skills in importance to create the player's final overall rating. In order to assess the overall rating correctly between the years and to create a standard, I will be reevaluating previous years from 2015-2018 which used a different calculation for the overall rating from 2019 and 2020 with the standard used in 2019 and 2020. Many people have experimented with player values to find the exact coefficients which is more explained here. But these are the standards that will be used.
As a few examples:
![]() | ![]() |
def SortPositions(df):
for i, rows in df.iterrows():
if rows["player_positions"] == "GK":
df.at[i,df.columns[3]] = (float(rows["goalkeeping_diving"]) * .24 + float(rows["goalkeeping_handling"]) *
.22 + float(rows["goalkeeping_positioning"]) * .22 +
float(rows["goalkeeping_reflexes"]) * .22 + float(rows["movement_reactions"])
* .06 + float(rows["goalkeeping_kicking"]) * .04)
elif rows["player_positions"] == "CB":
df.at[i,df.columns[3]] = (float(rows["defending_marking"]) * .15 + float(rows["defending_standing_tackle"]) * .15 +
float(rows["defending_sliding_tackle"]) * .15 +
float(rows["attacking_heading_accuracy"]) * .1 + float(rows["power_strength"])
* .1 + float(rows["mentality_aggression"]) * .08 +
float(rows["mentality_interceptions"]) * .08 +
float(rows["attacking_short_passing"]) * .05 +
float(rows["movement_reactions"]) * .05 + float(rows["power_jumping"]) * .04 +
float(rows["skill_ball_control"])* .05)
elif rows["player_positions"] == "RB" or rows["player_positions"] == "LB":
df.at[i,df.columns[3]] = (float(rows["defending_marking"]) * .10 + float(rows["defending_standing_tackle"])
* .12 + float(rows["defending_sliding_tackle"]) * .13 +
float(rows["attacking_heading_accuracy"]) * .07 + float(rows["power_stamina"])
* .08 + float(rows["mentality_aggression"]) * .05 +
float(rows["attacking_crossing"]) * .07 + float(rows["mentality_interceptions"])
* .12 + float(rows["attacking_short_passing"]) * .06 +
float(rows["movement_sprint_speed"]) * .05 + float(rows["movement_reactions"])
* .08 + float(rows["skill_ball_control"])* .07)
elif rows["player_positions"] == "RWB" or rows["player_positions"] == "LWB":
df.at[i,df.columns[3]] = (float(rows["defending_marking"]) * .09 +
float(rows["defending_standing_tackle"]) * .11 +
float(rows["defending_sliding_tackle"]) * .10 + float(rows["power_stamina"]) *
.08 + float(rows["attacking_crossing"]) * .10 + float(rows["skill_dribbling"])
* .07 + float(rows["movement_agility"]) * .03 +
float(rows["mentality_interceptions"]) * .10 +
float(rows["attacking_short_passing"]) * .10 +
float(rows["movement_sprint_speed"]) * .04 + float(rows["movement_reactions"])
* .08 + float(rows["skill_ball_control"])* .10)
elif rows["player_positions"] == "CM":
df.at[i,df.columns[3]] = (float(rows["skill_long_passing"]) * .13 + float(rows["power_stamina"]) * .06 +
float(rows["mentality_vision"]) * .12 + float(rows["power_long_shots"]) * .05 +
float(rows["skill_dribbling"]) * .09 + float(rows["defending_standing_tackle"])
* .06 + float(rows["mentality_interceptions"]) * .08 +
float(rows["attacking_short_passing"]) * .15 + float(rows["movement_reactions"])
* .08 + float(rows["mentality_positioning"]) * .08 +
float(rows["skill_ball_control"])* .10)
elif rows["player_positions"] == "CDM":
df.at[i,df.columns[3]] = (float(rows["skill_long_passing"]) * .11 + float(rows["power_stamina"]) * .06 +
float(rows["defending_marking"]) * .10 + float(rows["power_strength"]) * .06 +
float(rows["defending_standing_tackle"]) * .10 +
float(rows["mentality_interceptions"]) * .12 +
float(rows["attacking_short_passing"]) * .13 + float(rows["mentality_vision"])
* .08 + float(rows["movement_reactions"]) * .09 +
float(rows["mentality_aggression"]) * .05 + float(rows["skill_ball_control"])*
.09)
elif rows["player_positions"] == "CAM":
df.at[i,df.columns[3]] = (float(rows["movement_agility"]) * .04 + float(rows["movement_acceleration"]) *
.04 + float(rows["mentality_vision"]) * .16 + float(rows["power_long_shots"]) *
.06 + float(rows["skill_dribbling"]) * .11 + float(rows["attacking_finishing"])
* .05 + float(rows["attacking_short_passing"]) * .16 +
float(rows["power_shot_power"]) * .05 + float(rows["movement_reactions"]) * .08
+ float(rows["mentality_positioning"]) * .12 + float(rows["skill_ball_control"])
* .13)
elif rows["player_positions"] == "RM" or rows["player_positions"] == "LM":
df.at[i,df.columns[3]] = (float(rows["movement_agility"]) * .03 + float(rows["movement_acceleration"]) *
.05 + float(rows["mentality_vision"]) * .08 + float(rows["skill_long_passing"])
* .08 + float(rows["skill_dribbling"]) * .14 + float(rows["power_stamina"]) *
.05 + float(rows["attacking_crossing"]) * .14 +
float(rows["attacking_short_passing"]) * .12 +
float(rows["movement_sprint_speed"]) * .05 + float(rows["movement_reactions"])
* .07 + float(rows["mentality_positioning"]) * .07 +
float(rows["skill_ball_control"])* .12)
elif rows["player_positions"] == "RW" or rows["player_positions"] == "LW":
df.at[i,df.columns[3]] = (float(rows["power_shot_power"]) * .10 + float(rows["movement_acceleration"]) *
.04 + float(rows["mentality_vision"]) * .05 + float(rows["power_long_shots"]) *
.10 + float(rows["skill_dribbling"]) * .11 +
float(rows["attacking_heading_accuracy"]) * .05 +
float(rows["attacking_crossing"]) * .16 +
float(rows["attacking_short_passing"]) * .06 +
float(rows["movement_sprint_speed"]) * .04 + float(rows["movement_reactions"])
* .10 + float(rows["mentality_positioning"]) * .12 +
float(rows["skill_ball_control"])* .11)
elif rows["player_positions"] == "RF" or rows["player_positions"] == "CF" or rows["player_positions"] == "LF":
df.at[i,df.columns[3]] = (float(rows["power_shot_power"]) * .10 + float(rows["movement_acceleration"]) *
.04 + float(rows["mentality_vision"]) * .05 + float(rows["power_long_shots"]) *
.10 + float(rows["skill_dribbling"]) * .11 +
float(rows["attacking_heading_accuracy"]) * .05 +
float(rows["attacking_finishing"]) * .12 +
float(rows["attacking_short_passing"]) * .06 +
float(rows["movement_sprint_speed"]) * .04 + float(rows["movement_reactions"])
* .10 + float(rows["mentality_positioning"]) * .12 +
float(rows["skill_ball_control"])* .11)
elif rows["player_positions"] == "ST":
df.at[i,df.columns[3]] = (float(rows["power_shot_power"]) * .10 + float(rows["movement_acceleration"]) *
.05 + float(rows["attacking_volleys"]) * .05 + float(rows["power_long_shots"])
* .05 + float(rows["skill_dribbling"]) * .08 +
float(rows["attacking_heading_accuracy"]) * .10 +
float(rows["attacking_finishing"]) * .20 + float(rows["power_strength"]) * .03
+ float(rows["movement_sprint_speed"]) * .04 + float(rows["movement_reactions"]) * .10 + float(rows["mentality_positioning"]) * .12 + float(rows["skill_ball_control"])* .08)
SortPositions(fifa15)
SortPositions(fifa16)
SortPositions(fifa17)
SortPositions(fifa18)
fifa15 = fifa15.sort_values(by="overall",ascending = False)
fifa16 = fifa16.sort_values(by="overall",ascending = False)
fifa17 = fifa17.sort_values(by="overall",ascending = False)
fifa18 = fifa18.sort_values(by="overall",ascending = False)
fifa15.head()
In soccer there are different positions for each of the 11 players on the field. The following image does a good job in showing where on the field all the positions are (a quick guide is that L is left, C is center and R is right for positions. Then vertically you have GK as goalkeeper, B as back, M as midfielder, W as wing, and F as forward)

In this second image, it shows how each position is split (basically if it ends with a B it is a defensive player, M is a midfielder and W, F or ST is an attacker). This is how we are going to be splitting each position when we do our analysis as each position has different focuses. Although each position has different skills it needs to be successful, each type of player (defensive, midfielder, attacker, and goalkeeper) have enough in common with the other similar types of players that we won't split all the players into different lists.

These are the new dataframes by position we will use to store the data for each of the players.
attackers = pd.DataFrame(columns = fifa20.columns)
midfielders = pd.DataFrame(columns = fifa20.columns)
defenders = pd.DataFrame(columns = fifa20.columns)
goalkeepers = pd.DataFrame(columns = fifa20.columns)
This is the function that given an id, it will look through all 5 years and retrieve the data from each of the five years. It will then add it to a list so that the progression of each player can be displayed.
def aggregation(df, sid):
addDict = {}
for i in fifa20.columns:
five = fifa15.loc[fifa15['sofifa_id'] == sid]
six = fifa16.loc[fifa16['sofifa_id'] == sid]
seven = fifa17.loc[fifa17['sofifa_id'] == sid]
eight = fifa18.loc[fifa18['sofifa_id'] == sid]
nine = fifa19.loc[fifa19['sofifa_id'] == sid]
final = [list(five[i]),list(six[i]),list(seven[i]),list(eight[i]),list(nine[i])]
if i == "short_name":
addDict[i] = list((fifa19.loc[fifa19['sofifa_id'] == sid])[i])[0]
#print(addDict[i])
else:
addDict[i] = final
years = []
years.append([2015])
years.append([2016])
years.append([2017])
years.append([2018])
years.append([2019])
addDict["year"] = years
for keys in addDict.keys():
if type(addDict[keys]) == list:
flattened = [val for sublist in addDict[keys] for val in sublist]
addDict[keys] = flattened
return addDict
The following code simply sorts each player into their respective positions. Since there is a lot of information. I have decided to stick with simply the top 100 players in each position as I feel that is a large enough range to get statistics and data about player's growth into becoming world-class players.
unique_players = list(fifa19.sort_values(by="overall", ascending=False)["sofifa_id"])
aCount = 0
mCount = 0
dCount = 0
gCount = 0
for x in range(0,5000):
sid = unique_players[x]
pos = fifa19.loc[fifa19['sofifa_id'] == sid]["player_positions"]
pos = list(pos)[0]
if (pos == "CF" or pos == "ST" or pos == "LW" or pos == "RW") and aCount < 100:
attackers = attackers.append(fifa19.loc[fifa19['sofifa_id'] == sid],ignore_index=True)
attackers = attackers.append(fifa18.loc[fifa18['sofifa_id'] == sid],ignore_index=True)
attackers = attackers.append(fifa17.loc[fifa17['sofifa_id'] == sid],ignore_index=True)
attackers = attackers.append(fifa16.loc[fifa16['sofifa_id'] == sid],ignore_index=True)
attackers = attackers.append(fifa15.loc[fifa15['sofifa_id'] == sid],ignore_index=True)
aCount += 1
#print(attackers)
elif (pos == "CAM" or pos == "CM" or pos == "CDM" or pos == "RM" or pos == "LM") and mCount < 100:
midfielders = midfielders.append(fifa19.loc[fifa19['sofifa_id'] == sid],ignore_index=True)
midfielders = midfielders.append(fifa18.loc[fifa18['sofifa_id'] == sid],ignore_index=True)
midfielders = midfielders.append(fifa17.loc[fifa17['sofifa_id'] == sid],ignore_index=True)
midfielders = midfielders.append(fifa16.loc[fifa16['sofifa_id'] == sid],ignore_index=True)
midfielders = midfielders.append(fifa15.loc[fifa15['sofifa_id'] == sid],ignore_index=True)
mCount += 1
elif (pos == "LWB" or pos == "RWB" or pos == "LB" or pos == "RB" or pos == "CB") and dCount < 100:
defenders = defenders.append(fifa19.loc[fifa19['sofifa_id'] == sid],ignore_index=True)
defenders = defenders.append(fifa18.loc[fifa18['sofifa_id'] == sid],ignore_index=True)
defenders = defenders.append(fifa17.loc[fifa17['sofifa_id'] == sid],ignore_index=True)
defenders = defenders.append(fifa16.loc[fifa16['sofifa_id'] == sid],ignore_index=True)
defenders = defenders.append(fifa15.loc[fifa15['sofifa_id'] == sid],ignore_index=True)
dCount += 1
elif pos == "GK" and gCount < 100:
goalkeepers = goalkeepers.append(fifa19.loc[fifa19['sofifa_id'] == sid],ignore_index=True)
goalkeepers = goalkeepers.append(fifa18.loc[fifa18['sofifa_id'] == sid],ignore_index=True)
goalkeepers = goalkeepers.append(fifa17.loc[fifa17['sofifa_id'] == sid],ignore_index=True)
goalkeepers = goalkeepers.append(fifa16.loc[fifa16['sofifa_id'] == sid],ignore_index=True)
goalkeepers = goalkeepers.append(fifa15.loc[fifa15['sofifa_id'] == sid],ignore_index=True)
gCount += 1
attackers.head()
Here we can see the data of how player's have improved or gotten worse over time as well as compare different statistics over the years.
Throughout this project, the major assumption is that some trend between world-class players exists which would allow a model to be created to predict how youngsters will develop. In order to show this, we will take a look at the visual data of the different group of attackers and see if there are any trends between improvement over time.
We will mainly use line plots and distribution plots to look at our data. The code for those two types of graph are below:
def alinegraph(df):
dfline = sns.lineplot(x = "year", y = "overall", hue = "short_name", data = df)
coord = dfline.get_position()
dfline.set_position([coord.x0, coord.y0, coord.width * 3, coord.height * 3])
# Moving the legend away from being placed on top of the graph
dfline.legend(loc='center right', bbox_to_anchor=(1.2, .5))
# Setting the title and axis
dfline.set_title("Overall Rating Over Time")
dfline.set(ylabel = "Overall Rating", xlabel = "Year")
# Displaying the graph
plt.figure()
def adistgraph(df):
sidVals = df["sofifa_id"].unique()
change = []
for x in sidVals:
beg = list(fifa15.loc[fifa15["sofifa_id"]==x]["overall"])
end = list(fifa19.loc[fifa19["sofifa_id"]==x]["overall"])
if len(beg) > 0 and len(end) > 0:
oChange = int(end[0]) - int(beg[0])
change.append(oChange)
adist = sns.distplot(change)
coord = adist.get_position()
adist.set_position([coord.x0, coord.y0, coord.width * 2, coord.height * 2])
# Setting the title and axis
adist.set_title("Distribution of Change over 5 Years")
adist.set(ylabel = "Frequency", xlabel = "Overall Change")
# Displaying the graph
plt.figure()
Attacking positions are one of the most interesting positions to take a look at. Outside of the elite of elites (Messi and Ronaldo for example). Attacking positions tend to be short-lived because they have to be fast and accurate, a position that battles time. In addition, attackers stats tend to go up and down based on seasons where for one season a player may be amazing, a change in teams, style, or injury can completely derail an attacker more than any other. From these basic assumptions, the graph should be relatively varied, huge changes in ratings should be seen over 5 years as attacking players tend to get really good really quickly and slowly fall away.
attackers.overall = pd.to_numeric(attackers.overall)
alinegraph(attackers)
The assumption made in the beginning is relatively shown by this graph. One way to read this graph is to notice that the spectrum of colors represent the final rating. For example red is the most elite players and end at the top while green is the middle of the pack. By looking at the concentration of colors each year, we can see how much change each rating plays out. Just looking at the graph it can be seen that there are very few large increases (few to no red and orange line that are low in 2015) but most individuals do seem to increase relatively rapidly compared to what would be expected in other positions. Most lower ranked individuals in 2015 can make it to the middle of the pack in 2019, but almost none are able to make it past that point.
The following graph better shows the distribution of overall rating change over time for the attackers.
adistgraph(attackers)
From this graph we can see overall, the level of play rises for almost 1/4th of players by at least 5. The outliers are those who increase more than 15.
But it still is hard to see, so the following graphs are the top 100 players split into groups of 20 to show better variance in whether there are large variations in overall rating as well as the distribution of change.
a1 = attackers[:100]
a2 = attackers[100:198]
a3 = attackers[198:298]
a4 = attackers[298:393]
a5 = attackers[393:]
alinegraph(a1)
adistgraph(a1)
alinegraph(a2)
adistgraph(a2)
alinegraph(a3)
adistgraph(a3)
alinegraph(a4)
adistgraph(a4)
alinegraph(a5)
adistgraph(a5)
From top to bottom, is the top fifth of the top 100 attacking players to the bottom fifth. There's a lot to analyze with each of these graphs but I will simply highlight a few important points from each of the graphs. In the first set and second set (top 40% of top 100) we see that there is a lot more variance. While in the next 40%, there is a more uniform distribution. This will be important later in our model as we notice that to become the best of the best, there could possibly be large jumps while to be a world-class but not the best player, their overall rating and metrics should be similar to other players of that caliber, 5 years ago. For the bottom 20%, most were around that level 5 years prior with only 5% of those players have been very low rating players that made that jump.
The midfield positions define a team. The midfield decides whether the team favors defensive or attacking tactics depending on if the midfield plays higher up the pitch or lower and defends. It is the identity of the team. Overall, this position is one that will probably see less variation in change as most midfielders share very common abilities such as being good passers, dribblers and having good vision (overall understanding of what is happening on the field at all times).
midfielders.overall = pd.to_numeric(midfielders.overall)
alinegraph(midfielders)
adistgraph(midfielders)
As predicted, the change is very modal here around 5-7 being the average increase. This suggests that to be able to identify a world-class midfielder may be easier as they tend to all follow a similar trend. The goal then would be to identify the world-class talents that can increase 15 or more in rating (which is less than 8% according to the distribution graph).
Next is to take a look at the smaller subsets of this dataset as we did with the attackers.
a1 = midfielders[:99]
a2 = midfielders[99:198]
a3 = midfielders[198:300]
a4 = midfielders[300:393]
a5 = midfielders[393:]
alinegraph(a1)
adistgraph(a1)
alinegraph(a2)
adistgraph(a2)
alinegraph(a3)
adistgraph(a3)
alinegraph(a4)
adistgraph(a4)
alinegraph(a5)
adistgraph(a5)
Overall there is nothing too out of the ordinary. Most of these graphs have a very similar distribution and general trend in the lines. It is important to note that the graph with the most upward trend is the top 20% of midfield players, suggesting that the most elite midfielders may have a defining trait.
Defending is one of those positions that has really changed over the years. In the past defenders were tough and burly. Before every touch was a foul, defenders could almost barrel through their opponents. But the modern defender has to worry more about their decision making and positional awareness. This may be harder to quantify through a simple look at ratings and may require in-game statistics such as decision making under different scenarios. Because of this, I have zero predictions of what this graph could mean.
defenders.overall = pd.to_numeric(defenders.overall)
alinegraph(defenders)
adistgraph(defenders)
The first thing that stands out is that in the distribution, the percentage of players that tend to increase 15 or more is a lot higher than in any of the other datasets. This should imply that it would be a lot easier to find a world-class defender and the chance of them improving a significant amount is very high. This is especially the case as we can see in the line graph that the range of the players is very small with less than a handful being able to distinguish themselves from the rest.
Taking a look at the fifth splits:
a1 = defenders[:100]
a2 = defenders[100:201]
a3 = defenders[201:302]
a4 = defenders[302:397]
a5 = defenders[397:]
alinegraph(a1)
adistgraph(a1)
alinegraph(a2)
adistgraph(a2)
alinegraph(a3)
adistgraph(a3)
alinegraph(a4)
adistgraph(a4)
alinegraph(a5)
adistgraph(a5)
These are probably the most interesting fifth splits! Looking at the line graphs it can be seen that the rating of the players are almost distributed evenly among the splits besides the top 20%. Meaning the bottom 20% are all at about the same level, the next 20% are at the same level but slightly better and so on and so forth. It is also interesting to see that the starting ratings are all over the place for each fifth but then converge on one point. This backs up the earlier analysis of the defending positions having a higher probability to increase drastically compared to the other positions and this distribution appears to be uniform among all the fifths.
Goalkeeping is the one position that gets a category to itself. It is the position with the longest longevity with goalkeepers playing well into their 40s but also has one of the clearest lines between an elite player and a mediocre player. There aren't many world-class goalkeepers which makes the ability to find one while they are young very very important. But it isn't over if a really good goalkeeper isn't found as they have a longer time to improve.
goalkeepers.overall = pd.to_numeric(goalkeepers.overall)
alinegraph(goalkeepers)
adistgraph(goalkeepers)
Not as many players experience crazy growth as compared to defenders but a good portion still do. A good number see decent growth. But the range varies very greatly between the top 100 goalkeepers, more so than any other position.
Let's look at the fifth splits:
a1 = goalkeepers[:98]
a2 = goalkeepers[98:201]
a3 = goalkeepers[201:301]
a4 = goalkeepers[301:396]
a5 = goalkeepers[396:]
alinegraph(a1)
adistgraph(a1)
alinegraph(a2)
adistgraph(a2)
alinegraph(a3)
adistgraph(a3)
alinegraph(a4)
adistgraph(a4)
alinegraph(a5)
adistgraph(a5)
My analysis of the goalkeeper position is very similar to that of the defenders as the line graphs and the distribution graphs are very similar. The most radical change does occur in between the bottom 20% to bottom 40% meaning that other caliber goalkeepers tend to not go through such drastic change
With these general trends in mind and a better understanding of how position can influence how overall rating should be viewed. We will now begin to build our data model.